Skip to content

Conversation

wli51
Copy link
Collaborator

@wli51 wli51 commented Sep 22, 2025

Summary

This PR adds a preprocessing notebook for the DepMap PRISM secondary drug repurposing dataset, producing a clean, deduplicated table of drug–cell line IC50 values for downstream agentic experiments.

Key Changes

  1. Config file + template

    • Introduces a config file (ignored) for global configuration including the location of downloaded PRISM data and API keys for language models. The config.yml.template is included in place of the ignored config.yml.
  2. Notebook for deduplication, merging screens and generating summary visualization of the dataset

    • Resolves duplicate drug–cell line pairs within each screen (HTS002, MTS010).
    • Prioritizes entries with the highest curve‐fit quality (r^2)
    • Tabulates dataset composition by tissue and cell line.
  3. Trivial unit pytest to ensure the script version of notebook runs.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Comment on lines 38 to 47
try:
from IPython import get_ipython
shell = get_ipython().__class__.__name__
if shell == 'ZMQInteractiveShell':
print("Running in Jupyter Notebook")
IN_NOTEBOOK = True
else:
print("Running in IPython shell")
except NameError:
print("Running in standard Python shell")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a handy utility function in the making. Consider elevating this to be reusable across other notebooks without duplication.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

Comment on lines 54 to 57
git_root = subprocess.check_output(
["git", "rev-parse", "--show-toplevel"], text=True
).strip()
config_path = pathlib.Path(git_root) / "config.yml"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid a subproc call (which can become a challenge), consider using a pattern whereby the Jupyter notebook has access by default to the root of the project. This can be configured using a settings.json file if using VS Code to run notebooks. Alternatively, if using the Jupyter web interface, consider running Jupyter from the root of the project repo and navigating to the notebook.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configured default wd of notebooks to be the project root and dropped the subproc call!

Comment on lines +59 to +60
if not config_path.exists():
raise FileNotFoundError(f"Config file not found at: {config_path}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making use of pathlib.Path.resolve(strict=True), possibly embedded in the above variable label association.

Comment on lines +67 to +68
data_cfg = config.get("data")
if not data_cfg:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the use of the walrus operator here to help check and assign at the same time.

Suggested change
data_cfg = config.get("data")
if not data_cfg:
if not (data_cfg := config.get("data")):

Comment on lines +77 to +82
for key in required_keys:
value = data_cfg.get(key)
if value is None:
results.append((key, None, "Missing in config"))
errors.append(f"Config key '{key}' is missing")
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing these checks made me wonder if it might make sense to use yaml schema validation. It's common to use jsonschema to do this because yaml is a superset of json. If you move in this direction it takes a lot of the guesswork out of validating whether you have an object of a certain structure in yaml or json.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed using json schema!

label.set_rotation(90)

plt.tight_layout()
plt.show()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to save the plots? It might make comparisons as things proceed more clear. This might mean making the notebook check more flexible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made save plot default in both notebook/script mode and only skips showing plot in script!

Comment on lines +6 to +8
api:
openai:
key: "YOUR_OPENAI_API_KEY"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how this key is loaded consider the use of python-dotenv to help keep the key out of a file checked into source. Otherwise, or maybe either way, be sure to gitignore this file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will decide whether to adopt the change in future with more needed environment variables

Comment on lines 15 to 23
script_path = repo_root / "analysis" / "0.data_wrangling" / "nbconverted" / "0.1.wrangle_depmap_prism_data.py"

# Run from repo root so `git rev-parse --show-toplevel` and config.yml resolve
result = subprocess.run(
["python", str(script_path)],
cwd=repo_root,
capture_output=True,
text=True,
env={**os.environ}, # inherit env
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using subproc to run this as a script, consider using Pythonic imports. If you install the work as part of the venv you can use something like from analysis import wrangle. This might change the way you name or store things but would make the work more clear, Pythonic, and likely faster (Python's interpretation of the code would be faster than waiting for a new Python interpreter call to complete).

/data/processed/processed_depmap_prism_ic50.csv

# Actual config.yml
config.yml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding this file's details to the readme.

@wli51
Copy link
Collaborator Author

wli51 commented Oct 2, 2025

Thanks for reviewing Dave, merging!

@wli51 wli51 merged commit c40a11d into WayScience:main Oct 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants